empirical investigation
The Ladder in Chaos: Improving Policy Learning by Harnessing the Parameter Evolving Path in A Low-dimensional Space Hongyao Tang
Deep Reinforcement Learning (DRL) is far from well understood, although its great potential has been demonstrated with a lot of achievements in different practical problems [Badia et al., 2020, Shah et al., 2022, Fawzi et al., 2022, Degrave et al., 2022, OpenAI, 2022]. Consistent efforts are made to gain a better understanding of the learning dynamics of RL agents.
Can machines perform a qualitative data analysis? Reading the debate with Alan Turing
This paper reflects on the literature that rejects the use of Large Language Models (LLMs) in qualitative data analysis. It illustrates through empirical evidence as well as critical reflections why the current critical debate is focusing on the wrong problems . The paper proposes that the focus of researching the use of the LLMs for qualitative analysis is not the method per se, but rather the empirical investigation of an artificial system performing an analysis . The paper bui lds on the seminal work of Alan Turing and reads the current debate using key ideas from Turing's "Computing Machinery and Intelligence". Th is paper therefore reframes the debate on qualitative analysis with LLMs and states that ra ther than asking whether machines can perform qualitative analysis in principle, we should ask whether with LLMs we can produce analyses that are sufficiently comparable to human analysts. In the final part the contrary views to performing qualitative analysis with LLMs are analysed using the same writing and rhetorical style that Turing used in his seminal work, to discuss the contrary views to the main question.
An Empirical Investigation of Domain Generalization with Empirical Risk Minimizers (Appendix)
See table 1 for the results. We next perform regression in the Joint setting (Sec.5.3, main paper) where we fit a regression model across all environments, with 5 features instead of 2 reported in the main We find that it is possible to get an Spearman's We considered a set of 40 metrics overall and report only a small subset of them in the main paper. In table 2 we provide detailed results of all the measures we study. Figure 1 provides details of the canonicalization performed on each of the measures as explained in the main paper. In particular, (Ben-David et al., 2007) prove We also develop measures based on follow-up theoretical work in (Ben-David et al., 2010) on divergence measures using the symmetric difference hypothesis space. Here we summarize a result from (Ben-David et al., 2010), This canonicalization is used to report the results in Sec. 5 H: Z P (Y), we follow the steps in algorithm 1. Algorithm 1 Computing H -divergence measure As explained in the main paper, this divergence measure was proposed in (Ben-David et al., 2010).
An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case
Giachino, Gioele, Rondina, Marco, Vetrò, Antonio, Coppola, Riccardo, De Martin, Juan Carlos
The increasing use of Large Language Models (LLMs) in a large variety of domains has sparked worries about how easily they can perpetuate stereotypes and contribute to the generation of biased content. With a focus on gender and professional bias, this work examines in which manner LLMs shape responses to ungendered prompts, contributing to biased outputs. This analysis uses a structured experimental method, giving different prompts involving three different professional job combinations, which are also characterized by a hierarchical relationship. This study uses Italian, a language with extensive grammatical gender differences, to highlight potential limitations in current LLMs' ability to generate objective text in non-English languages. Two popular LLM-based chatbots are examined, namely OpenAI ChatGPT (gpt-4o-mini) and Google Gemini (gemini-1.5-flash). Through APIs, we collected a range of 3600 responses. The results highlight how content generated by LLMs can perpetuate stereotypes. For example, Gemini associated 100% (ChatGPT 97%) of 'she' pronouns to the 'assistant' rather than the 'manager'. The presence of bias in AI-generated text can have significant implications in many fields, such as in the workplaces or in job selections, raising ethical concerns about its use. Understanding these risks is pivotal to developing mitigation strategies and assuring that AI-based systems do not increase social inequalities, but rather contribute to more equitable outcomes. Future research directions include expanding the study to additional chatbots or languages, refining prompt engineering methods or further exploiting a larger experimental base.
An Empirical Investigation of Domain Generalization with Empirical Risk Minimizers
Recent work demonstrates that deep neural networks trained using Empirical Risk Minimization (ERM) can generalize under distribution shift, outperforming specialized training algorithms for domain generalization. The goal of this paper is to further understand this phenomenon. In particular, we study the extent to which the seminal domain adaptation theory of Ben-David et al. (2007) explains the performance of ERMs. Perhaps surprisingly, we find that this theory does not provide a tight explanation of the out-of-domain generalization observed across a large number of ERM models trained on three popular domain generalization datasets. This motivates us to investigate other possible measures--that, however, lack theory--which could explain generalization in this setting.
An Empirical Investigation of Domain Generalization with Empirical Risk Minimizers
Recent work demonstrates that deep neural networks trained using Empirical Risk Minimization (ERM) can generalize under distribution shift, outperforming specialized training algorithms for domain generalization. The goal of this paper is to further understand this phenomenon. In particular, we study the extent to which the seminal domain adaptation theory of Ben-David et al. (2007) explains the performance of ERMs. Perhaps surprisingly, we find that this theory does not provide a tight explanation of the out-of-domain generalization observed across a large number of ERM models trained on three popular domain generalization datasets. This motivates us to investigate other possible measures--that, however, lack theory--which could explain generalization in this setting.
Introducing Milabench: Benchmarking Accelerators for AI
Delaunay, Pierre, Bouthillier, Xavier, Breuleux, Olivier, Ortiz-Gagné, Satya, Bilaniuk, Olexa, Normandin, Fabrice, Bergeron, Arnaud, Carrez, Bruno, Alain, Guillaume, Blanc, Soline, Osterrath, Frédéric, Viviano, Joseph, Patil, Roger Creus-Castanyer Darshan, Awal, Rabiul, Zhang, Le
AI workloads, particularly those driven by deep learning, are introducing novel usage patterns to high-performance computing (HPC) systems that are not comprehensively captured by standard HPC benchmarks. As one of the largest academic research centers dedicated to deep learning, Mila identified the need to develop a custom benchmarking suite to address the diverse requirements of its community, which consists of over 1,000 researchers. This report introduces Milabench, the resulting benchmarking suite. Its design was informed by an extensive literature review encompassing 867 papers, as well as surveys conducted with Mila researchers. This rigorous process led to the selection of 26 primary benchmarks tailored for procurement evaluations, alongside 16 optional benchmarks for in-depth analysis. We detail the design methodology, the structure of the benchmarking suite, and provide performance evaluations using GPUs from NVIDIA, AMD, and Intel. The Milabench suite is open source and can be accessed at github.com/milaiqia/milabench.
An Empirical Investigation into the Effect of Parameter Choices in Knowledge Distillation
Sultan, Md Arafat, Trivedi, Aashka, Awasthy, Parul, Sil, Avirup
We present a large-scale empirical study of how choices of configuration parameters affect performance in knowledge distillation (KD). An example of such a KD parameter is the measure of distance between the predictions of the teacher and the student, common choices for which include the mean squared error (MSE) and the KL-divergence. Although scattered efforts have been made to understand the differences between such options, the KD literature still lacks a systematic study on their general effect on student performance. We take an empirical approach to this question in this paper, seeking to find out the extent to which such choices influence student performance across 13 datasets from 4 NLP tasks and 3 student sizes. We quantify the cost of making sub-optimal choices and identify a single configuration that performs well across the board.
An Empirical Investigation of Value-Based Multi-objective Reinforcement Learning for Stochastic Environments
Ding, Kewen, Vamplew, Peter, Foale, Cameron, Dazeley, Richard
One common approach to solve multi-objective reinforcement learning (MORL) problems is to extend conventional Q-learning by using vector Q-values in combination with a utility function. However issues can arise with this approach in the context of stochastic environments, particularly when optimising for the Scalarised Expected Reward (SER) criterion. This paper extends prior research, providing a detailed examination of the factors influencing the frequency with which value-based MORL Q-learning algorithms learn the SER-optimal policy for an environment with stochastic state transitions. We empirically examine several variations of the core multi-objective Q-learning algorithm as well as reward engineering approaches, and demonstrate the limitations of these methods. In particular, we highlight the critical impact of the noisy Q-value estimates issue on the stability and convergence of these algorithms.